52 research outputs found
Robust M-Estimation Based Bayesian Cluster Enumeration for Real Elliptically Symmetric Distributions
Robustly determining the optimal number of clusters in a data set is an
essential factor in a wide range of applications. Cluster enumeration becomes
challenging when the true underlying structure in the observed data is
corrupted by heavy-tailed noise and outliers. Recently, Bayesian cluster
enumeration criteria have been derived by formulating cluster enumeration as
maximization of the posterior probability of candidate models. This article
generalizes robust Bayesian cluster enumeration so that it can be used with any
arbitrary Real Elliptically Symmetric (RES) distributed mixture model. Our
framework also covers the case of M-estimators that allow for mixture models,
which are decoupled from a specific probability distribution. Examples of
Huber's and Tukey's M-estimators are discussed. We derive a robust criterion
for data sets with finite sample size, and also provide an asymptotic
approximation to reduce the computational cost at large sample sizes. The
algorithms are applied to simulated and real-world data sets, including
radar-based person identification, and show a significant robustness
improvement in comparison to existing methods
Gravitational Clustering: A Simple, Robust and Adaptive Approach for Distributed Networks
Distributed signal processing for wireless sensor networks enables that
different devices cooperate to solve different signal processing tasks. A
crucial first step is to answer the question: who observes what? Recently,
several distributed algorithms have been proposed, which frame the
signal/object labelling problem in terms of cluster analysis after extracting
source-specific features, however, the number of clusters is assumed to be
known. We propose a new method called Gravitational Clustering (GC) to
adaptively estimate the time-varying number of clusters based on a set of
feature vectors. The key idea is to exploit the physical principle of
gravitational force between mass units: streaming-in feature vectors are
considered as mass units of fixed position in the feature space, around which
mobile mass units are injected at each time instant. The cluster enumeration
exploits the fact that the highest attraction on the mobile mass units is
exerted by regions with a high density of feature vectors, i.e., gravitational
clusters. By sharing estimates among neighboring nodes via a
diffusion-adaptation scheme, cooperative and distributed cluster enumeration is
achieved. Numerical experiments concerning robustness against outliers,
convergence and computational complexity are conducted. The application in a
distributed cooperative multi-view camera network illustrates the applicability
to real-world problems.Comment: 12 pages, 9 figure
Bayesian Cluster Enumeration Criterion for Unsupervised Learning
We derive a new Bayesian Information Criterion (BIC) by formulating the
problem of estimating the number of clusters in an observed data set as
maximization of the posterior probability of the candidate models. Given that
some mild assumptions are satisfied, we provide a general BIC expression for a
broad class of data distributions. This serves as a starting point when
deriving the BIC for specific distributions. Along this line, we provide a
closed-form BIC expression for multivariate Gaussian distributed variables. We
show that incorporating the data structure of the clustering problem into the
derivation of the BIC results in an expression whose penalty term is different
from that of the original BIC. We propose a two-step cluster enumeration
algorithm. First, a model-based unsupervised learning algorithm partitions the
data according to a given set of candidate models. Subsequently, the number of
clusters is determined as the one associated with the model for which the
proposed BIC is maximal. The performance of the proposed two-step algorithm is
tested using synthetic and real data sets.Comment: 14 pages, 7 figure
Real Elliptically Skewed Distributions and Their Application to Robust Cluster Analysis
This article proposes a new class of Real Elliptically Skewed (RESK)
distributions and associated clustering algorithms that allow for integrating
robustness and skewness into a single unified cluster analysis framework.
Non-symmetrically distributed and heavy-tailed data clusters have been reported
in a variety of real-world applications. Robustness is essential because a few
outlying observations can severely obscure the cluster structure. The RESK
distributions are a generalization of the Real Elliptically Symmetric (RES)
distributions. To estimate the cluster parameters and memberships, we derive an
expectation maximization (EM) algorithm for arbitrary RESK distributions.
Special attention is given to a new robust skew-Huber M-estimator, which is
also the maximum likelihood estimator (MLE) for the skew-Huber distribution
that belongs to the RESK class. Numerical experiments on simulated and
real-world data confirm the usefulness of the proposed methods for skewed and
heavy-tailed data sets
Robust and Efficient Aggregation for Distributed Learning
Distributed learning paradigms, such as federated and decentralized learning,
allow for the coordination of models across a collection of agents, and without
the need to exchange raw data. Instead, agents compute model updates locally
based on their available data, and subsequently share the update model with a
parameter server or their peers. This is followed by an aggregation step, which
traditionally takes the form of a (weighted) average. Distributed learning
schemes based on averaging are known to be susceptible to outliers. A single
malicious agent is able to drive an averaging-based distributed learning
algorithm to an arbitrarily poor model. This has motivated the development of
robust aggregation schemes, which are based on variations of the median and
trimmed mean. While such procedures ensure robustness to outliers and malicious
behavior, they come at the cost of significantly reduced sample efficiency.
This means that current robust aggregation schemes require significantly higher
agent participation rates to achieve a given level of performance than their
mean-based counterparts in non-contaminated settings. In this work we remedy
this drawback by developing statistically efficient and robust aggregation
schemes for distributed learning
Identifying the Complete Correlation Structure in Large-Scale High-Dimensional Data Sets with Local False Discovery Rates
The identification of the dependent components in multiple data sets is a
fundamental problem in many practical applications. The challenge in these
applications is that often the data sets are high-dimensional with few
observations or available samples and contain latent components with unknown
probability distributions. A novel mathematical formulation of this problem is
proposed, which enables the inference of the underlying correlation structure
with strict false positive control. In particular, the false discovery rate is
controlled at a pre-defined threshold on two levels simultaneously. The
deployed test statistics originate in the sample coherence matrix. The required
probability models are learned from the data using the bootstrap. Local false
discovery rates are used to solve the multiple hypothesis testing problem.
Compared to the existing techniques in the literature, the developed technique
does not assume an a priori correlation structure and work well when the number
of data sets is large while the number of observations is small. In addition,
it can handle the presence of distributional uncertainties, heavy-tailed noise,
and outliers.Comment: Preliminary versio
Shuffled Multi-Channel Sparse Signal Recovery
Mismatches between samples and their respective channel or target commonly
arise in several real-world applications. For instance, whole-brain calcium
imaging of freely moving organisms, multiple-target tracking or multi-person
contactless vital sign monitoring may be severely affected by mismatched
sample-channel assignments. To systematically address this fundamental problem,
we pose it as a signal reconstruction problem where we have lost
correspondences between the samples and their respective channels. Assuming
that we have a sensing matrix for the underlying signals, we show that the
problem is equivalent to a structured unlabeled sensing problem, and establish
sufficient conditions for unique recovery. To the best of our knowledge, a
sampling result for the reconstruction of shuffled multi-channel signals has
not been considered in the literature and existing methods for unlabeled
sensing cannot be directly applied. We extend our results to the case where the
signals admit a sparse representation in an overcomplete dictionary (i.e., the
sensing matrix is not precisely known), and derive sufficient conditions for
the reconstruction of shuffled sparse signals. We propose a robust
reconstruction method that combines sparse signal recovery with robust linear
regression for the two-channel case. The performance and robustness of the
proposed approach is illustrated in an application related to whole-brain
calcium imaging. The proposed methodology can be generalized to sparse signal
representations other than the ones considered in this work to be applied in a
variety of real-world problems with imprecise measurement or channel
assignment.Comment: Submitted to TS
- …